{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 通过聚类识别登陆养卡群体"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Silhouette Coefficient 轮廓系数\n",
    "* 轮廓系数 <br>\n",
    "轮廓系数适用于实际类别信息未知的情况。对于单个样本，设$a$是与它同类别中其他样本的平均距离（类似类内局），$b$是与它距离最近不同的类别中样本的距离（类似类间距）<br>\n",
    "* $$s = \\frac{b-a}{max(a,b)}$$\n",
    "<br>\n",
    "对于一个样本集合，它的轮廓系数是所有样本轮廓系数的平均值。轮廓系数的取值范围[-1,1],同类别样本距离越相近并且不同样本类别距离越远，轮廓系数的得分越大.<br>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.cluster import KMeans\n",
    "from sklearn import metrics\n",
    "kmeans_model = KMeans(n_clusters=3,random_state=0).fit(X)\n",
    "labels = kmeans_model.labels_\n",
    "print(metrics.silhouette_score(X,labels,metric='euclidean'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Calinski-Harabaz Index\n",
    "这个计算简单直接，简单的来说 Calinski-Harabaz 分值越大则说明聚类效果越好。<br>\n",
    "$$s(k)=\\frac{tr(B_k)}{tr(W_k)}\\times\\frac{m-k}{k-1}$$\n",
    "<br>其中$m$表示集群样本数,$k$表示类别数,$B_k$为类别之间的协方差矩阵,$W_k$类别内部样本的协方差,$tr()表示矩阵的迹$，也就是说类内样本协方差小，类间样本协方差大，则 Calinski-Harabaz 分值高。在scikit-learn中， Calinski-Harabasz Index对应的方法是metrics.calinski_harabaz_score.\n",
    "\n",
    "$$W_k = \\sum\\limits_{q=1}^k{\\sum\\limits_{x \\in C_q}(x-c_q)(x-c_q)^T}$$\n",
    "<br>\n",
    "$$B_k=\\sum\\limits_q n_q(c_q-c)(c_q-c)T$$\n",
    "<br>\n",
    "$C_q$为cluster q中的点集，$c$为E的中心，$n_q$为cluster q中的点数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import metrics\n",
    "from sklearn.metrics import pairwise_distances\n",
    "from sklearn import datasets\n",
    "import numpy as np\n",
    "from sklearn.cluster import KMeans\n",
    "data_set = datasets.load_iris()\n",
    "X = data_set.data\n",
    "y = data_set.target\n",
    "kmeans_model = KMeans(n_clusters=3,random_state=0).fit(X)\n",
    "labels = kmeans_model.labels_\n",
    "ch_score = metrics.calinski_harabaz_score(X,labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pylab as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.cluster import DBSCAN\n",
    "from mpl_toolkits.mplot3d import Axes3D\n",
    "from matplotlib.ticker import MultipleLocator,FormatStrFormatter\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 连接hive 数据库"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyhive import hive\n",
    "cursor = hive.connect(host='192.168.1.106',post=10000,\n",
    "                username='root',database='zjyd3').cursor()\n",
    "cursor.execute('select * from temp_wqf_log_info_04')\n",
    "a = cursor.fetchall()\n",
    "alldata = pd.DataFrame(a,columns=['phonenumber','sl','ips_pro','imeis_pro','timecv','fail_pro'])\n",
    "del a "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 数据标准化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import MinMaxScaler\n",
    "train_data = alldata[['sl','ips_pro','imeis_pro','timecv']]\n",
    "train_data.fillna(0,inplace=True)\n",
    "scaler = MinMaxScaler()\n",
    "train_data = scaler.fit_transform(train_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 模型选择优化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.cluster import KMeans\n",
    "from sklearn.metrics import silhouette_score\n",
    "from sklearn.metrics import calinski_harabaz_score\n",
    "n_class = []\n",
    "ch_values = []\n",
    "for i in range(10):\n",
    "    n_cluster = i + 2\n",
    "    print(n_class)\n",
    "    clf = KMeans(n_clusters=n_cluster,random_state=0)\n",
    "    label = clf.fit_predict(train_data)\n",
    "    ch_values.append(calinski_harabaz_score(train_data,label))\n",
    "    n_class.append(n_cluster)\n",
    "plt.plot(n_class,ch_values,label='First line',linewidth=3,\n",
    "         color='r',marker='o',markerfacecolor='blue',\n",
    "         markersize=12)\n",
    "ax = plt.subplot(111)\n",
    "x_major_Locator = MultipleLocator(1)\n",
    "x_major_Formatter = FormatStrFormatter('%5.0f')\n",
    "ax.xaxis.set_major_locator(x_major_Locator)\n",
    "ax.xaxis.set_major_formatter(x_major_Formatter)\n",
    "plt.xlabel('n_class')\n",
    "plt.ylabel('ch values')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
